-
Notifications
You must be signed in to change notification settings - Fork 506
Fix LWIP failure after 256 Ethernet disconnects #3212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects. Properly clean up DHCP state on link ::end (shutdown).
|
Thank you for the quick correction and feedback. |
|
@yohine with this plus #3213 I got 3301 WiFi.begin()/Wifi.end() cycles overnight with no leaks. Unfortunately at the 3302nd loop the CYW43 chip started timing out and not responding to messages from the CYW43 driver running on the Pico. Debugging w/GDB I can see the driver try and send packets to the CYW43 and it doesn't respond w/in the timeout. So AFAICT the binary blob running on the 2nd ARM chip has hung/died/something at this point. Nothing we can do about that here since it's completely opaque. |
I've been testing for a few days now. While I haven't reached a final conclusion yet, after making the same changes as in your #3213, the problem appears to have been resolved in my environment. Deleting netif_remove() from netif_add() no longer causes the panic. However, upon careful inspection, there appear to be slight differences in our fixes. These may or may not be related, but I'll introduce them for testing. @LwipIntfDev::begin @cyw43_spi_transfer() in cyw43_bus_pio_spi.c |
|
1 difference is trivial. I just cleaned up that ugly, ugly switch. I think originally it had more to it, but you can clearly see it's doing a {netif_remove/return false} on dhcp_start != ERR_OK. So, I cleaned that up as I was going because I don't want to end up on TheDailyWTF. 2 diff in your case I think has a race condition. Because networking is IRQ driven, it would be possible for you to get an IRQ right after 3 diff, if you found a bug in the CYW43 driver then please do post something on Pico-SDK to get the fix for everyone. I don't modify the upstream SDK for this core, at all, for sanity's sake. (Also, unless you reran I think real the diff may just be in testing methods. I was banging |
|
The reason for the different results could be the test content, or it could be related to the PIO or hardware. At this point, I don't think there's a clear answer. In my long-term testing, the complete stoppage has not recurred even once. Instead, I've found another serious problem, which I'm currently investigating. This is a phenomenon where the reconnection status gets stuck at 6 after exceeding a certain number of connection attempts or time. Status 3 indicates a successful connection, but it remains at 6. Disconnecting the AP returns it from 6 to 4, but the same problem persists afterward. I haven't yet determined the specific time or number of attempts, and I'm still collecting data. It's unclear whether this is related to the current problem, but if my PIO timeout is triggered, that might be the cause. The behavior of the CYW43 after disconnection is undefined. However, at this point, it's only a possibility. Regarding point 3: If it's ultimately confirmed to be a PIO problem, I will report it on the Pi Forum. However, the current information is probably not enough to convince them. Based on my experience, they are unlikely to trust my report. In my environment, all necessary sources have been removed from the static library, and everything is compiled locally. I've confirmed that creating an infinite loop in the local function of cyw43_bus_pio_spi results in a correct stop. For example, I used a command like this: Unfortunately, I have other work to do, so I won't be able to test for about a week. Therefore, the resumption of the above retesting will be after that. However, I plan to continue investigating this problem until I can solve it or until I give up. |
Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects.
Properly clean up DHCP state on link ::end (shutdown).